-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ComputeDomain for running multi-node workloads #225
Conversation
38065b4
to
3e51cd8
Compare
4ce9bdb
to
0d435d8
Compare
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
For now just mark one well-known erro as permanent. Future commits will abstract this better and mark more errors as permanaent. Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
1e6a587
to
a45c238
Compare
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
df90001
to
764c3c7
Compare
// ComputeDomainSpec provides the spec for a ComputeDomain. | ||
type ComputeDomainSpec struct { | ||
NumNodes int `json:"numNodes"` | ||
Channel *ComputeDomainChannelSpec `json:"channel"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should channel be optional thinking on non imex use cases in the future, I know currently we are solely focused on imex support, but if we want to carry on the concept of computeDomain, we might face clusters without imex (channels)
@@ -0,0 +1,89 @@ | |||
/* | |||
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe like some clothing brands do Since 1987*
but I am not lawyer, maybe the number on the license header has a deeper legal meaning
@@ -0,0 +1,49 @@ | |||
/* | |||
* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is DRA 2022 old?
Signed-off-by: Kevin Klues <[email protected]>
764c3c7
to
474f968
Compare
tail -f /dev/null & wait | ||
fi | ||
/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg | ||
tail -n +1 -f /var/log/nvidia-imex.log & wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in a sync meeting a while ago: here we give up control of the IMEX daemon process (do we? how does errexit behave when a daemonized process exits non-zero?). In any case, for robustness and debuggability it will be good to actively monitor the health of the IMEX daemon process (polling the process, or better: getting a health signal actively and straight from the process). I'd like to look into that at some point, after merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a problem. If the daemon crashes we will not exit (but the liveness probe will eventually fail and the pod will be restarted). We should make it more robust as a followup (probably by not doing everything in bash but instead writing a small go utility).
No description provided.